-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src/grammar for the canonical grammar of the Rust language #1331
Conversation
👍 😍 💯 |
The semi-canonical step may be unnecessary. I don't expect there to be very many discrepancies between what we think the grammar is and what the grammar actually is, so we could just switch to the fully canonical stage directly. Even if a discrepancy is discovered after that point, all it would take would be an RFC to determine whether it should result in a change to the grammar or a change to the parser (obviously taking into account how much breakage would result from a change to the parser). |
I'm hugely interested in this, many thanks! One question that bothers me is how the consistency between And if we speak of test suite, it would be ideal if it is implementation independent and included in something like ctrs. |
We have vestigial grammar tests still in the makefiles, I think. |
I share @matklad's concerns. It'd be awesome to have a canonical grammar, but maintaining consistency seems like a major challenge. I'm not sure how we could actually go about testing it in a rigorous manner. Since the compiler's parser doesn't have to parse in the same way a generated parser (from the grammar) would, figuring out whether or not they are consistent with each other seems difficult. That said, less-rigorous testing could be done with a combination of fuzzing and regression tests. The grammar could be used to generate syntactically-correct (if nonsensical) code which is run through the compiler's parser. If the sample fails to parse, we can look at it and decide if it's a "real" failure or not. If it is, we can fix the bug and use the sample (or a reduced version) as the regression test. The downside here is that it doesn't enforce the same parse. The compiler could parse a code fragment without error, but not actually be correct. I can't see this being a problem though. With all that said, some discussion around testing the grammar and parser would be nice to see in the RFC proper. |
For reference https://github.com/rust-lang/rust/blob/master/src/grammar/check.sh and https://github.com/rust-lang/rust/blob/master/src/grammar/verify.rs and other stuff in |
@steveklabnik I thought those are only lexer tests |
They at least used to be referred to as "grammar tests", though maybe I cleared some of them out and they're not there anymore. We were like, using antlr and such... |
|
||
This process is sure to become ambiguous over time as syntax is increasingly adjusted (it is harder | ||
to “blame” syntax changes compared to syntax additions), therefore the resolution process of | ||
discrepancies will also depend more on a decision from the Rust team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this should be the final decision of the language subteam, I think. Seems like the most logical decision making body for this sort of question. (And it's already what we do in practice.)
@nagisa I am very positive on this idea. For some time now I have been wanting to have a grammar changes written up more rigorously. I think your plan is pretty coherent -- basically there's going to be a messy period where we have to make ad-hoc judgement calls on what is the 'canonical' grammar for a bit of time, but eventually we'll get to a clean state. I do think it's important that we add automatic checking of the Rust grammar against the in-tree sources (at least). Ideally we'd maybe even do crater runs or something, but you gotta run before you can walk. One question that I am wondering is how the in-tree grammar compares to e.g. the rust-grammar project in terms of completeness? (Also, is one project based on the other?) I know that @rprichard has been filing various issues about discrepancies between what |
cc @rprichard @bleibig @dgrunwald as people who seem likely to have an opinion on this RFC and related matters. |
As far as I can see our in-tree LARL grammar is pretty much a copy of this project (sans the changes that have been made to the grammar). I suspect, without much digging, that it was imported from the rust-grammar project itself. |
I’m aware it is usually possible to generate valid inputs from a given grammar (or its state machine), but there’s no existing attempts to do this as far as I know. While that sounds like a nice way to test the parser conformance, I’m not even sure how complex the implementation of such tester would be. Answers to similar question are also pretty hand-wavy. |
ISTM that the rust compiler, libs, and crates.io provide plenty of sample inputs :) |
Different grammar productions can interact in surprising ways. For example, the operator
It's not obvious that this is even possible (and it probably shouldn't be), so I'd be surprised if you can find such code in the wild. |
@nikomatsakis sample inputs will allow to verify that the formal grammar accepts what rustc accepts. But even the trivial grammar can do this, namely, the grammar of all token sequences. The hard problem is to make sure that the formal grammar does not accept invalid Rust. As I understand @Aatch suggestion was about this direction of verification. And the second hard problem is to make sure that both the grammar and the rustc produce similar parse trees: you can easily mix up operator precedence, for a start. |
@dgrunwald @matklad yes, ok, right, forgot about @Aatch's question. Nonetheless, I think that comparing the two against all the Rust code we can get our hands on (and yes, comparing the parse trees, which seems eminently doable) provides a fairly high degree of confidence that we are parsing what we want to, despite the theoretical disparities. Certainly generating inputs would let us go further (though we'd presumably be filtering these inputs against the rustc impl, which still means that there may be cases where both do the wrong thing in the same way). |
I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP. |
It is, of course. =) |
To me it sounds like generating parser is a natural step to take after we have a proper (canonical and verified to be correct) grammar. We need some way to ensure a grammar matches current implementation (and/or becomes the canonical representation of the Rust’s grammar) before we migrate rustc to a parser generator of any sort, don’t we? However, that sounds like something that would need to be discussed once we have a generator itself ready, right? |
Why would we migrate rustc to a generated parser? Good messages for syntax errors are so much easier in a hand-written recursive descend parser. Also, LR parsers are especially problematic for use in IDEs: if you need a (partial) syntax tree for code completion, you'll have to puzzle it together from the partially matched productions remaining on the parser's stack. Or (equivalently) introduce a whole bunch of error productions, and you'll always forget some and have cases where nodes aren't integrated into the syntax tree and code completion suffers. This stuff will be much much easier with the existing hand-written parser (since it's LL-style, there's always an obvious choice for connecting the current node to the root of the tree when you encounter an error). |
@dgrunwald Note that we did not necessarily say an LR parser. I was hoping On Thu, Nov 12, 2015 at 12:30 PM, Daniel Grunwald notifications@github.com
|
Careful there; you seem to be assuming that it will be possible to specify an unambiguous formal grammar that matches rustc's behavior (modulo rustc bugs that we can fix without breaking the world). I'm not sure if this is possible. Some things are easy to specify in English (and in a hand-written parser), but hard in a formal grammar. For example, take this small grammar, which looks like a tiny subset of rust's expression grammar (allowing integer literals, the "as" operator, and less than operator):
But this grammar already accepts some expressions that rustc does not: e.g. It's easy to say: "A '<' token after a type name is always considered to be part of the type." But doing this in the formal grammar is not so easy:
Note that we just duplicated the number of productions! A few more such disambiguation rules, and our grammar increases exponentially in size. It becomes unreadable and unmaintainable. Keeping the disambiguation rule in plain English is much better. This is also the option taken in the C# language specification: the formal grammar in the appendix is just a rough approximation, and sections like "7.6.4.2 Grammar ambiguities" explain how the parser should handle specific tricky cases. But if the formal grammar is just an approximation, that makes automated consistency checks rather tricky... |
We'll have to see how it goes. I'll note though that LALRPOP's macros On Thu, Nov 12, 2015 at 7:37 PM, Daniel Grunwald notifications@github.com
|
FWIW, the |
(Nominating for final comment period discussion, since it seems like we've reached a fixed point here in this comment thread.) |
Hear ye, hear ye! This RFC is now entering final comment period. |
This is potentially very far off in future times, but would having a canonical grammar like this make it easier for people to develop alternate implementations? |
It would certainly help, though probably not by THAT much. |
Huzzah! This RFC has been accepted by the language design subteam. |
Rendered